Cjk name detection

ABSTRACT

Aspects directed to name detection are provided. A method includes generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring. The method includes applying the raw name detection model to a collection of semi-structured data to form annotated semi?structured data identifying n-grams identifying names and n?grams not identifying names and applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names. The method includes generating a name detection model, including deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names, deriving a not-name model using the semi?structured data not identifying names, and deriving a language model using the large annotated corpus.

BACKGROUND

This specification relates to name detection, specifically namedetection for Chinese, Japanese, and Korean (“CJK”) languages.

Name detection is typically used in natural language processing, forexample, automatic speech recognition (ASR), machine translation (MT),optical character recognition (OCR), sentence parsing, non-Romancharacter input method editor (IME), and web search applications.

Naïve Bayesian classification methods can be used to detect if asequence of characters “X” identifies a name, depending on the ratio ofthe probability of “X” identifying a name given its context (e.g.,characters occurring before or after “X”) and the probability of “X” notidentifying a name given its context. Language models are used tocompute these conditional probabilities. A typical statistical languagemodel is a probability measurement of a word or a sequence of charactersgiven its history (e.g., the occurrence of previous word or charactersequences in a collection of data). In particular, a conventional n-gramlanguage model based on a Markov assumption, is used to predict a wordor a sequence of characters.

A n-gram is a sequence of n consecutive tokens, e.g. words orcharacters. A n-gram has an order, which is the number of tokens in then-gram. For example, a 1-gram (or unigram) includes one token; a 2-gram(or bi-gram) includes two tokens.

A given n-gram can be described according to different portions of then-gram. A n-gram can be described as a context and a future token(context, c), where the context has a length n−1 and c represents thefuture token. For example, the 3-gram “x y z” can be described in termsof a n-gram context and a future token. The n-gram context includes alltokens of the n-gram preceding the last token of the n-gram. In thegiven example, “x y” is the context. The left most token in the contextis referred to as the left token. The future token is the last token ofthe n-gram, which in the example is “z”. The n-gram can also bedescribed with respect to a right context and a backed off context. Theright context includes all tokens of the n-gram following the firsttoken of the n-gram, represented as a (n−1)-gram. In the example above,“y z” is the right context. Additionally, the backed off context is thecontext of the n-gram less the left most token in the context. In theexample above, “y” is the backed off context.

Each n-gram has an associated probability estimate that is calculated asa function of n-gram relative frequency in training data. For example, astring of L tokens is represented as C₁ ^(L)=(c₁, c₂, . . . , c_(L)). Aprobability can be assigned to the string C₁ ^(L) as:

${{P( c_{I}^{L} )} = {{\prod\limits_{i = 1}^{L}{P( {c_{i}c_{1}^{i - 1}} )}} \approx {\prod\limits_{i = 1}^{L}{\hat{P}( {c_{i}c_{i - n + 1}^{i - 1}} )}}}},$

where the approximation is based on a Markov assumption that only themost recent (n−1) tokens are relevant when predicting a next token inthe string, and the “̂” notation for P indicates that it is anapproximation of the probability function.

In CJK languages, sentences do not have word boundaries. As a result,sentences need to be segmented automatically before the detection ofpeople's names. Therefore, segmentation errors will be propagated toname detection.

CJK names have morphologic laws that can be obtained from largestatistics. For example, 300 common Chinese family names cover 99% ormore of the population. Female names often contain characters such as

(na, hong, bing, li). Usually, common given names are independent offamily names. For example, if statistics are available for a combinationof the family name

and a given name

a combination of another family name

and the given name

identifying a name can be predicted using the statistics of

identifying a family name and the statistics of

identifying a given name. Furthermore, some words in Chinese can eitherbe a person's name or a regular word, e.g.,

can be either the name of a famous singer in China, or a common wordmeaning daybreak. The detection of such a name largely depends on thecontext.

In addition, CJK names are generally identified using 2-grams (bigrams)or 3-grams (trigrams). Assuming a horizontal convention of reading CJKtext from left to the right, the left most character in the context is afamily name. The right context is a given name. For example, if “x y z”is a CJK name, then “x” is a family name and “y z” is a given name. As afurther example, if “x y” is a CJK name, then “x” is a family name and“y” is a given name.

SUMMARY

Systems, methods, and computer program products for name detection areprovided that are particularly useful for detecting names made up ofideographs, e.g., Chinese characters. In general, in one aspect, amethod is provided. The method includes generating a raw name detectionmodel using a collection of family names and an annotated corpusincluding a collection of n-grams, each n-gram having a correspondingprobability of occurring as a name in the annotated corpus. The methodalso includes applying the raw name detection model to a collection ofsemi-structured data to form annotated semi-structured data, theannotated semi-structured data identifying n-grams identifying names andn-grams not identifying names. The method also includes applying the rawname detection model to a large unannotated corpus to form a largeannotated corpus data identifying n-grams of the large unannotatedcorpus identifying names and n-grams not identifying names. The methodalso includes generating a name detection model, including deriving aname model using the annotated semi-structured data identifying namesand the large annotated corpus data identifying names, deriving anot-name model using the semi-structured data not identifying names, andderiving a language model using the large annotated corpus. Otherembodiments of this aspect include systems and computer programproducts.

Implementations of the aspect can include one or more of the followingfeatures. The aspect can further include applying the name detectionmodel to the collection of semi-structured data to form the annotatedsemi-structured data, the annotated semi-structured data identifyingn-grams identifying names and n-grams not identifying names, applyingthe name detection model to the large unannotated corpus to form thelarge annotated corpus data identifying n-grams of the large unannotatedcorpus identifying names and n-grams not identifying names, andgenerating a refined name detection model. Generating the refined namedetection model can include deriving a refined name model using theannotated semi-structured data identifying names and the large annotatedcorpus data identifying names, deriving a refined not-name model usingthe semi-structured data not identifying names, and deriving a refinedlanguage model using the large annotated corpus.

The name model can include a collection of n-grams from the annotatedsemi-structured data identifying names and the large annotated corpusidentifying names, where each n-gram includes a family name as a leftcharacter and a given name as right context, and each n-gram has acorresponding probability of identifying a name. The not-name model caninclude a collection of n-grams from the annotated semi-structured datanot identifying names, where each n-gram includes a family name as aleft character and a given name as right context, and each n-gram has acorresponding probability of not identifying a name. The raw name modelcan include a collection of n-grams from the annotated corpus, whereeach n-gram includes a left character that is a family name from thecollection of family names, and each n-gram has a correspondingprobability of identifying a name according to a relative frequency ofthe name in the annotated corpus. The raw name model can be generatedusing a collection of foreign family names.

The collection of family names can include a plurality of sparse familynames and the raw name detection model uses a single probability of allsparse family names in place of a calculated probability of a specificsparse family name of the plurality of spare family names to identifyprobabilities of each n-gram, that includes a left character that is asparse family name, identifying a name. The collection of family namescan include a plurality of foreign family names.

In general, in one aspect, a method is provided. The method includesreceiving an input string of characters and applying a name detectionmodel to the input string having a plurality of characters. Applying thename detection model includes identifying a most likely segmentation ofthe plurality of characters where the plurality of characters do notinclude one or more names, detecting one or more sequences of charactersof the plurality of characters as potentially identifying one or morenames, identifying a segmentation of the plurality of characters wherethe plurality of characters include the one or more potential names, andsegmenting the plurality of characters as including the one or morenames when the likelihood of the segmentation including the potentialone or more names is greater than the most likely segmentation notincluding one or more names. Other embodiments of this aspect includesystems and computer program products.

Implementations of the aspect can include one or more of the followingfeatures. The aspect can further include detecting one or more nameswhen the plurality of characters is segmented as including one or morenames. The aspect can further include receiving a string including aplurality of characters and calculating a probability that a particularsequence of the string identifies a name, the name includes a familyname and a given name, including: when the frequency of the particularsequence in a corpus is less than a threshold value, determining theprobability that the particular sequence identifies a name as a functionof a relative frequency that the portion of the sequence representing agiven name occurs with any family name and the relative frequency of theportion of the sequence representing the family name.

The aspect can further include receiving user input data and applyingthe raw name detection model to the user input data to form annotateduser input data, the annotated user input data identifying n-gramsidentifying names and n-grams not identifying names. Generating the namedetection model can further include deriving the name model using theannotated user input data identifying names, deriving the not-name modelusing the annotated user input data not identifying names, and derivinga language model using the annotated user input data.

Particular embodiments of the subject matter described in thisspecification can be implemented to realize one or more of the followingadvantages. CJK name detection can be performed with or withoutpre-segmenting input text into words, preventing word segmentationerrors that result in name detection errors. Training a name detectionmodel does not require large amounts of human annotated data. Sometraining data can be applied to semi-structured data (e.g., descriptionsof downloads in xml files). A vast amount of unannotated data, inparticular, input method editor (IME) user inputs, IME userdictionaries, web pages, search query logs, emails, blogs, instantmessage (IM) scripts, and news articles can be used to train the namedetection model. The use of this data guarantees both high precision andhigh recall in name detection. A name detection model can also be usedto detect names with sparse family names and foreign names. In addition,CJK name detection includes iterative training to further refine thename detection model to detect names added to the previous namedetection model.

The details of one or more embodiments of the subject matter describedin this specification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-1B show example Chinese text.

FIG. 2 is a block diagram illustrating an example generation of a rawname detection model.

FIG. 3 is a block diagram illustrating an example generation of a namedetection model.

FIG. 4 is a block diagram illustrating components of an example namedetection model.

FIG. 5 is a block diagram illustrating an example hidden Markov modelfor an observed sequence of Chinese characters.

FIG. 6 is a flow chart showing an example process for detecting names.

FIG. 7 is an example system for CJK name detection.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION Formula for Detecting Names

Some words in Chinese can either be a person's name or a regular word,e.g.,

can be either the name of a famous singer in China, or a common wordmeaning daybreak. For example, FIG. 1A shows Chinese text that includesa character sequence

100 meaning daybreak. As another example, FIG. 1B shows Chinese textthat includes a character sequence

102 as the name of the famous singer in China. These character sequencescan be classified as either identifying names or not identifying names.

In particular, n-grams are classified as either identifying names or notidentifying names. A given n-gram w can be classified as eitheridentifying a name (NAME) or not identifying a name (NOTNAME) usingBayes Rule. Bayes Rule provides that the probability of a given n-gram widentifying a name given context can be defined as:

${P( {w = {{NAME}{context}}} )} = {\frac{P( {{w = {NAME}},{context}} )}{P({context})}.}$

Similarly, the probability of a given n-gram not identifying a name canbe defined as:

${P( {w = {{NOTNAME}{context}}} )} = {\frac{P( {{w = {NAME}},{context}} )}{P({context})}.}$

Furthermore, a ratio can be defined as:

${ratio} = {\frac{{P( {w = {{NAME}{context}}} )}{L( {NAME}\Rightarrow{NOTNAME} )}}{{P( {w = {{NOTNAME}{context}}} )}{L( {NAME}\Rightarrow{NOTNAME} )}}.}$

In one implementation, if the resulting ratio value is greater than one,then the n-gram is classified as identifying a name. In other words, acost-weighted likelihood that the n-gram w identifies a name is greaterthan a cost-weighted likelihood that the n-gram w does not identify aname. Otherwise, the n-gram is classified as not identifying a name. Lrepresents a particular loss function. In some implementations, the lossfunction is a constant such that the equation can be simplified as:

${{ratio} = {c \cdot \frac{P( {{w = {NAME}},{context}} )}{P( {{w = {NOTNAME}},{context}} )}}},$

where c is a constant. The joint probabilities, P(w=NAME, context) andP(w=NOTNAME, context), can be provided as output from a name detectionmodel, as described in further detail below with respect to FIGS. 2-4.

As an initial overview, the name detection model includes a name model,a not-name model, and a language model. A raw name model is generatedusing a pre-defined collection of family names and an annotated corpusto identify whether n-grams in an annotated corpus identify a name. Theraw name model is applied to semi-structured data and large unannotateddata to generate a name detection model.

In particular, the name detection model derives probability estimatesfrom P(w=NAME, context) and P(w=NOTNAME, context). Specifically, thejoint probability P(w=NAME, context) can be rewritten as:

P _(name)(W,context)=P _(name)(prefix)P _(name)(W|prefix)P_(name)(suffix|W,prefix).

P_(name)(W, context) can be further approximated as:

P_(name)(Prefix)P_(name)(family_name,given_name|prefix)P_(name)(suffix|family_name,given_name)  Expression [1]

In addition, the joint probability P(w=NOTNAME, context) can besimilarly approximated as:

P_(notname)(prefix)P_(notname)(family_name,given_name|prefix)P_(notname)(suffix|family_name,given_name)  Expression [2]

Raw Name Detection Model

FIG. 2 is a block diagram 200 illustrating an example generation of araw name detection model 206. For convenience, generation of the rawname detection model 206 will be described with respect to a system thatperforms the generation.

In CJK text, a given n-gram can identify a name only if the leftcharacter in the n-gram is a family name. The right context is a givenname. Therefore, a pre-defined collection of family names 204 is used togenerate the raw name detection model 206. The system can generate theraw name detection model 206 using a small amount of annotated trainingdata. The system trains the raw name model 206 by using an annotatedcorpus (e.g., small annotated corpus 202) and a collection ofpre-defined family names 204.

The pre-defined family names 204 includes a collection of family namesin one or more CJK languages. For example, for Chinese name detectionmodel, the pre-defined family names 204 can include a collection of 300common Chinese family names, which statistically cover 99% or more ofpossible Chinese family names in a given population. The small annotatedcorpus 202 includes a small collection of text data, for example, webdocuments or search queries. The text data of the small annotated corpus202 includes n-grams that have been identified (e.g., annotated) asidentifying names or not identifying names. For example, the names canbe manually identified by one or more individuals.

After generation, the raw name detection model 206 includes probabilityestimates calculated as a function of relative frequencies of n-grams inthe small annotated corpus 202, with left characters that are found inthe collection of family names 204, identifying names and n-grams notidentifying names. Thus, the raw name model 206 can be used to calculatethe probability than an input n-gram identifies a name or does notidentify a name (e.g., to detect a name based on the ratio describedabove). However, this is limited by the probabilities of the smallannotated corpus, which may not be accurate over a large collection ofdata. As a result, the raw name detection model 206 is further appliedto training data to generate a name detection model, as discussed infurther detail below with respect to FIG. 3.

Training Data

FIG. 3 is a block diagram 300 illustrating an example generation of aname detection model 314. An annotation process 316 (e.g., performed bythe raw name detection model 206) is applied to unannotated data inorder to generate an expanded name detection model. Semi-structured data302 and a large unannotated corpus 308 can be used as the unannotateddata.

Semi-structured data 302 can include, for example, xml files. Thesemi-structured data 302 can include data having a number of differentfields. The particular fields can be used to identify names and notnames. For example, the semi-structured data 302 can include XML filesidentifying music information where one of the fields is an artistfield.

The large unannotated corpus 308 provides a collection of text in atarget language (e.g., Chinese, Japanese, or Korean). The largeunannotated corpus 308 can include a number of different sources oftext, including, e.g., web queries, web pages, and news articles. Insome implementations, the large unannotated corpus 308 includes text onthe order of tens to hundreds of billions of characters, or even more.

The annotation process 316 is applied and forms subsets of training dataused to train sub-models of the name detection model 314. In particular,the probability estimates of n-grams identifying names and n-grams notidentifying names, determined from the small annotated corpus 202 togenerate the raw name detection model 206, are used to separate thetraining data into training data that identifies names and training datathat does not identify names.

The system applies annotation process 316 to the semi-structured data302 to form annotated semi-structured data (e.g., 304 and 306). Inparticular, the raw name detection model 206 is used to separate thesemi-structured data 302 and form a subset of annotated semi-structureddata that includes n-grams identifying names 304, and form a subset ofannotated semi-structured data that includes n-grams not identifyingnames 306. For example, if an xml file contains a n-gram “artist: c1 c2c3”, where “c1 c2 c3” is a CJK name, the n-gram is placed in the subsetof annotated semi-structured data that includes n-grams identifyingnames 304. As another example, if the xml file also contains a n-gram“title: c4 c5”, where “c4 c5” does not identify a name (e.g., the titleof a song), the n-gram is placed in the subset of annotatedsemi-structured data that includes n-grams not identifying names 306.

The system also applies annotation process 316 to a large unannotatedcorpus 308 to form large annotated corpus data (e.g., 310 and 312). Inparticular, the raw name detection model 206 is used to separate thelarge unannotated corpus into a set of large annotated corpus data thatincludes n-grams identifying names 310, and a set of large annotateddata that includes n-grams not identifying names 312. For example, if aweb page sentence includes the character sequence “c1 c2 c3 c4 c5 c6”,where “c2 c3 c4” is a CJK name, then the sentence is placed into the setof large annotated corpus data that includes n-grams identifying names310. Alternatively, if the annotation process 316, when applied to thesentence, does not detect a name, the sentence is placed into the set oflarge annotated corpus data that includes n-grams not identifying names312.

A training process 318 uses the annotated semi-structure data (e.g., 304and 306) and large annotated corpus data (e.g., 310 and 312) to generatea name detection model 314, as discussed below in further detail withrespect to FIG. 4.

In some implementations, the unannotated data can include user inputdata, including, e.g., scripts of IME and user-edited lists of words orphrases. The system applies annotation process 316 to the user inputdata to form annotated user input data identifying names and annotateduser input data not identifying names. The training process 318 thenuses the annotated user input data to generate the name detection model314.

Name Detection Model

FIG. 4 is a block diagram illustrating components of an example namedetection model 314. The name detection model 314 includes a name model402, not-name model 404, and a language model 406.

Name Model

The subset of semi-structured data that includes n-grams identifyingnames 304 and the set of large annotated corpus data that includesn-grams identifying names 310 is used to derive a name model 402. Thesystem uses these sets of data to determine the probability that an-gram including a family name and a given name identifies a name, or:

P_(name)(family_name,given_name).

In particular, the subset of semi-structured data that includes n-gramsidentifying names 304 and the set of large annotated corpus data thatincludes n-grams identifying names are used to generate probabilitiesestimates as a function of the relative frequencies of n-gramsidentifying names occurring in the sets of data.

In some implementations, the annotated user input is used to generatethe probability estimates.

Not-Name Model

The subset of semi-structured data that includes n-grams not identifyingnames is used to derive a not-name model 404. The system uses thissubset of data to determine the probability that a n-gram including afamily name and a given name does not identify name, or:

P_(notname)(family_name,given_name).

In particular, this subset of data is used to generate probabilitiesestimates as a function of the relative frequencies of n-gramsidentifying names in the subset of data.

In some implementations, the annotated user input is used to generatethe probability estimates.

Language Model

The sets of large annotated data (e.g. 310 and 312) are used to derive alanguage model 406. The system uses these sets of data to determine theprobability that a n-gram identifies a name or does not identify a nameusing context of the n-gram. Specifically, the system determines theprobabilities that a suffix identifies a name given a name candidate anda name candidate identifies a name given a prefix, or:

P_(name)(suffix|name)and

P_(name)(name|prefix),

to derive a language sub-model with names.

Furthermore, the system determines the probabilities that a suffix doesnot identify a name given a name candidate and a name candidate does notidentify a name given a prefix, or:

P_(notname)(suffix|name)and

P_(notname)(name|prefix),

to derive a language sub-model without names.

A prefix is one or more characters of a sequence of characters thatprecedes a n-gram name candidate. A suffix is one or more characters ofa sequence of characters that follows a n-gram candidate. For example,for the sequence of characters “c1 c2 c3 c4 c5 c6 c7” where the namecandidate is “c3 c4 c5”, the prefix is “c1 c2” and the suffix is “c6c7”.

The set of large annotated data that includes n-grams identifying names310 is used to generate probability estimates as a function of relativefrequencies of n-grams being names in the set of data given a particularprefix or suffix. Also, the set of large annotated data that includesn-grams not identifying names 312 is used to generate probabilityestimates as a function of relative frequencies of n-grams not beingnames in the set of data given a particular prefix or suffix.

In some implementations, the annotated user input is used to generatethe probability estimates.

In summary, the raw name detection model 206 is used in an annotationprocess 316 to separate the semi-structured data 302 and the largeunannotated corpus 308 and form annotated semi-structured data (304 and306) and a large annotated corpus (310 and 312). The system uses thisannotated data and a training process 318 to train name detection model314 including name model 402, not-name model 404, and language model406.

Refined Formula for Detecting Names

The probability estimates from the name model and the language model areused to determine P(NAME|context). For example, if a sequence ofcharacters is “c1 c2 c3 c4 c5 c6 c7”, and “c3 c4 c5” is a name, then theprobability that “c3 c4 c5” is a name given the context (i.e., prefix is“c1 c2”, and suffix is “c6 c7”), or the P(NAME|context), can be derivedfrom Expression [1] above. P(NAME|context) can be expressed as:

P_(name)(c3|prefix)P_(name)(c4c5|c3)P_(name)(suffix|c3,c4c5).

This expression can be rewritten generically as:

P_(name)(family_nameprefix)P_(name)(given_namefamily_name)P_(name)(suffixfamily_name, given_name),  where  ${P_{name}( {{given\_ name}{family\_ name}} )} = {\frac{P_{name}( {{family\_ name},{given\_ name}} )}{P_{name}({family\_ name})}.}$

As described above, the name model can be trained to determineP_(name)(family_name, given_name). Furthermore, the language model canbe trained to determine P_(name)(family_name|prefix) andP_(name)(suffix|family_name, given_name).

The probability estimates from the name model and language model areused to determine P(NOTNAME|context) in a similar manner. For example,if a sequence of characters is “c1 c2 c3 c4 c5 c6 c7”, and “c3 c4 c5” isnot a name, then the probability that “c3 c4 c5” is not a name given thecontext (i.e., prefix is “c1 c2”, and suffix is “c6 c7”), or theP(NOTNAME|context), can be derived from Expression [2] above.P(NOTNAME|context) can be expressed as:

P_(notname)(c3|prefix)P_(notname)(c4c5|c3)P_(notname)(suffix|c4c5).

This expression can be rewritten generically as:

P_(notname)(family_name|prefix)P_(notname)(given_name|family_name)P_(notname)(suffix|family_name,given_name).

As described above, the not-name model can be trained to determineP_(notname)(family_name, given_name). Furthermore, the language modelcan be trained to determine P_(notname)(family_name|prefix) andP_(notname)(suffix|family_name, given_name).

Training Iterations

In some implementations, the name detection model 314 is further used toseparate the semi-structured data 302 and the large unannotated corpus308 into annotated semi-structured data (304 and 306) and a largeannotated corpus (310 and 312). For example, in FIG. 3, name detectionmodel 314 is used in the annotation process 316 to separate thesemi-structured data 302 and large unannotated corpus 308. In someimplementations, these new sets of training data are used to generate amore refined name detection model. The more refined name detection modelhas greater coverage than the raw name detection model due to the use oflarger training data to derive probability estimates of n-grams eitheridentifying names or not identifying names.

In some implementations, the annotated user input is used to generatethe more refined name detection model.

Further refinements of the name detection model can be achieved bytraining the name detection model in two or more iterations. Eachiterations enhances the coverage of the name model. In someimplementations, a number of iterations (e.g., three iterations) can bespecified. Alternatively, the number of iterations can be based onconditions, for example, the condition that the probability estimatesprovided as output by the name detection model do not change more than athreshold between iterations.

Further Refinements to the Name Detection Model

The relative frequency can be low for particular names (e.g. sparsenames, sparse family names, or foreign names that have a low frequencyof occurrence in the training data). As a result, the correspondingprobability estimates can be inaccurate. This results in additionalsparse data problems. Therefore, smoothing techniques can be used toaccount for low frequency, or sparse names. If the frequency of asequence of characters occurring in the training data is less than athreshold, smoothing techniques can be used.

Sparse Names

In some implementations, the probability of a name occurring isindependent of the probabilities of a family name occurring and a givenname occurring. For example, if “y” is a given name for a family name“x”, then a name is “x y”. Furthermore, “z” can be a sparse family name.Name “z y” represents the sparse family name “z” and a given name “y”,where the sparse family name “z” was not sampled or was sampled at a lowfrequency (e.g., below a specified threshold frequency). In oneimplementation, the system uses the probability of “x y” to approximatethe probability of “z y”. In particular, the probabilities of the eventthat “x” is a family name and the event that “y” is a given name aretreated independently.

As a result, the probability of a given name “y” occurring given asparse family name “z”, or P(y|z), can be approximated in terms ofstatistics of “x y”, where:

${P( {yz} )} = {\frac{\# \mspace{14mu} {xy}}{\# \mspace{14mu} x} \cdot {\frac{\# \mspace{14mu} z}{\# \mspace{14mu} {all\_ family}{\_ names}}.}}$

For example, if the frequency of “z y” in the training data is less thana threshold, the probability that “z y” is a name is a function of theprobability that “y” is a given name for any name and the probability ofthe family name “z” occurring.

For example, returning to the refined formula for detecting names, theP_(notname)(suffix|family_name, given_name) may not be preciselyestimated. In some implementations, a back-off strategy can beimplemented such that P_(notname)(suffix|family_name, given_name) can beexpressed as:

BackoffWeight(family_name,givenname)P_(notname)(suffix|all_family_names,given_name).

Sparse Family Names

In some implementations, a probability of all sparse family names isused as a substitute for a probability of a single sparse family name.For example, if “a” is a given name and “b” is a family name. Theprobability of a name occurring given context can be represented byP(a|b)P(b|context). If “b” is a sparse family name, the probabilityP(a|b) can be inaccurate. In this implementation, the probability of aname occurring in a given context is more accurately represented byusing the probability that “a” occurs in the training data given allsparse family names multiplied by the probability that all sparse familynames occurs in the training data given the context, or:

P(a|all_sparse_family_names)P(b|all_sparse_family)P(all_sparse_family_names|context

Foreign Name Detection Model

The relative frequency of foreign names (e.g., translated names) canalso be low and result in inaccurate probability estimates. Therefore, aforeign name detection model can be generated according to the samesteps described above with respect to generating a name detection model314. In particular, a raw foreign name detection model is generated froma pre-defined collection of foreign last names in a similar manner asgenerating the raw name detection model 206. The raw foreign namedetection model can be applied to other data (e.g., large unannotateddata and semi-structured data) to generate a foreign name detectionmodel in a similar manner as generating the name detection model 314.

Segmentation

When using the name detection model to detect names for a given inputsequence of n-grams, the probability estimates of the n-grams eitheridentifying names or not identifying names are used to segment sequencesof characters into words and simultaneously detect names.

In some implementations, a sequence of CJK characters are arranged in ahidden Markov model. A hidden Markov model is a statistical model thatincludes hidden parameters and observable parameters. For example, theobservable parameters are the sequence of CJK characters, and the hiddenparameters are possible sequences of CJK words. Specifically, particularsequences of characters in CJK can result in one or more sequences ofwords because CJK characters or combinations of CJK characters can havedifferent meanings. For example, a sequence of characters “c1 c2 c3” isa possible sequence of a CJK word. In addition, “c1 c2” can also be apossible sequence of another CJK word.

In some implementations, a Viterbi algorithm is used to segment thehidden Markov model. The Viterbi algorithm is a dynamic programmingalgorithm for finding the most likely sequence of hidden states (e.g.segmentation paths) that results in a sequence of observed events. Forexample, the Viterbi algorithm is used to find the most likely sequenceof CJK words that results in a sequence of CJK characters.

The most likely sequence of CJK words can be written as:

${\underset{W}{\arg \mspace{11mu} \max}\mspace{11mu} {P( {WC} )}},$

which describes the sequence of CJK words, W, out of all possiblesequences of CJK words, that provide the highest value for P(W|C), whereW=w₁, w₂, . . . w_(M) and C is a sequence of CJK characters representedby C=c₁, c₂, . . . c_(L). Additionally, Bayes Rule provides that:

${P( {WC} )} = {\frac{{P(W)}{P( {CW} )}}{P(C)}.}$

The language model provides P(W). Using Bayes Rule, the most likelysequence of CJK words given a sequence of CJK characters can bere-written as:

${\underset{W}{\arg \mspace{11mu} \max}\mspace{11mu} {P( {WC} )}} = {\underset{W}{\arg \mspace{11mu} \max}\mspace{11mu} {P( {WC} )}{{P(W)}.}}$

Consequently, the most likely W (i.e., the most likely sequence of CJKwords) is one that maximizes the product of the probability that Woccurs and the probability that W would consist of C (i.e., theprobability that a given sequence of CJK words would map onto a sequenceof CJK characters).

CJK name detection detects CJK names as it is segmenting the sequence ofcharacters into words.

Referring to FIG. 5, for example, an observed input string of CJKcharacters 502 includes

, where

is preceded by an identifier <S> designating the beginning of thesequence and

is followed by an identifier <E> designating the end of the sequence.

Assume that the sequence of characters

and

are words that have been previously detected in training. Further,assume that

and

are potential names (i.e., have been detected as identifying names inthe training data). In the naïve model, if

and

have never been detected as words, the probability of

and

being words is low, and the sequence of characters would likely besegmented into single characters. Detecting names after thissegmentation scheme results in errors.

In the naïve model, some example segmentations of words in a hiddenMarkov model (e.g., hidden Markov model 500) are:

-   -   <S>        <E>,    -   <S>        <E>,    -   <S>        <E>, and    -   <S>        <E>.

However, incorporating the name detection model, the sequence ofcharacters

can be detected as characters that potentially identify a name; and thesequence of characters

can also be detected as characters that potentially identify a name.These sequences of characters have associated probabilities of beingwords, in the sense that the sequences of characters have associatedprobabilities of potentially identifying a name.

Therefore, other example segmentations of words are added to the model.In this refined hidden Markov model, additional example segmentations ofwords are:

-   -   <S>        <E>,    -   <S>        <E>,    -   <S>        <E>,    -   <S>        <E>; and    -   <S>        <E>,    -   <S>        <E>,    -   <S>        <B>.

Using this model, segmenting a sequence of characters includessegmenting the sequence of characters into words depending on thelikelihood of the segmentation including a potential name. Theintroduction of other likely sequences that include potential namesprevents the aforementioned segmentation errors from propagating intoname detection. If a segmentation path including a name is more likelyto occur than a segmentation path that does not include a name, then thesegmentation path including a name is used and a name is detected. Thedetected sequence of characters identifying a name and its correspondingprobability of identifying a name is added to the name detection model314.

In some implementations, the name detection model 314 is used to detectnames in an input text. For example, a detector receives CJK input textand uses the name detection model 314 to simultaneously segment the CJKinput text into words and detect names from the CJK input text.

FIG. 6 is a flow chart showing an example process for detecting names600. For convenience, the process for detecting names 600 will bedescribed with respect to a system that performs the detection. Duringprocess for detecting names 600, the system scans a received sequence ofcharacters from the beginning of the sequence until the end of thesequence for names.

The system receives 602 a sequence of characters (e.g., a sequence ofChinese characters). In particular, the system identifies 604 a firstcharacter of the sequence. The system determines 606 if the identifiedcharacter is a family name candidate. If the character is a family namecandidate (e.g., a character in the collection of family names 204), thesystem detects 614 names using the refined formula for detecting names(e.g., the ratio with refinements), as described above.

If the character is not a family name candidate, then the systemdetermines 608 if the character is a prefix of a family name candidate.If the character is a prefix of a family name candidate, then the systemdetects 614 names using the refined formula for detecting names (e.g.,the ratio with refinements), as described above.

If the character is not a prefix of a family name candidate, then thesystem determines 610 if the system has reached the end of the sequenceof characters. Similarly, after detecting 614 names using the refinedformula for detecting names, the system also determines 610 if thesystem has reached the end of the sequence of characters. If the systemreaches the end of the sequence, the process terminates 616. If thesystem has not reached the end of the sequence of characters, then thesystem identifies 612 a next character of the sequence and repeats steps606, 608, 610, and optionally 614 for other characters of the sequenceuntil the end of the sequence is reached.

Example System

FIG. 7 is an example system 700 for CJK name detection. A dataprocessing apparatus 710 can include hardware/firmware, an operatingsystem and one or more programs, including detection program 720. Thedetection program 720 operates, in conjunction with the data processingapparatus 710, to effect the operations described in this specification.Thus, the detection program 720, in combination with one or moreprocessors and computer-readable media (e.g., memory), represents one ormore structural components in the system 700.

The detection program 720 can be a detection processing application, ora portion. As used here, an application is a computer program that theuser perceives as a distinct computer tool used for a defined purpose.An application can be built entirely into the operating system (OS) ofthe data processing apparatus 710, or an application can have differentcomponents located in different locations (e.g., one portion in the OSor kernel mode, one portion in the user mode, and one portion in aremote server), and an application can be built on a runtime libraryserving as a software platform of the apparatus 710. Moreover,application processing can be distributed over a network 780 using oneor more processors 790. For example, a language model of the detectionprogram 720 can be distributively trained over the one or moreprocessors 790.

The data processing apparatus 710 includes one or more processors 730and at least one computer-readable medium 740 (e.g., random accessmemory, storage device, etc.). The data processing apparatus 710 canalso include a communication interface 750, one or more user interfacedevices 760, and one or more additional devices 770. The user interfacedevices 760 can include display screens, keyboards, mouse, stylus, orany combination thereof.

Once programmed, the data processing apparatus 710 is operable togenerate the language model, name models, and foreign name models.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, or in computer software, firmware, or hardware, including thestructures disclosed in this specification and their structuralequivalents, or in combinations of one or more of them. Embodiments ofthe subject matter described in this specification can be implemented asone or more computer program products, i.e., one or more modules ofcomputer program instructions encoded on a tangible program carrier forexecution by, or to control the operation of, data processing apparatus.The tangible program carrier can be a propagated signal or acomputer-readable medium. The propagated signal is an artificiallygenerated signal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a computer.The computer-readable medium can be a machine-readable storage device, amachine-readable storage substrate, a memory device, a composition ofmatter effecting a machine-readable propagated signal, or a combinationof one or more of them.

The term “data processing apparatus” encompasses all apparatus, devices,and machines for processing data, including by way of example aprogrammable processor, a computer, or multiple processors or computers.The apparatus can include, in addition to hardware, code that creates anexecution environment for the computer program in question, e.g., codethat constitutes processor firmware, a protocol stack, a databasemanagement system, an operating system, or a combination of one or moreof them.

A computer program (also known as a program, software, softwareapplication, script, or code) can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages, and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A computer program does notnecessarily correspond to a file in a file system. A program can bestored in a portion of a file that holds other programs or data (e.g.,one or more scripts stored in a markup language document), in a singlefile dedicated to the program in question, or in multiple coordinatedfiles (e.g., files that store one or more modules, sub-programs, orportions of code). A computer program can be deployed to be executed onone computer or on multiple computers that are located at one site ordistributed across multiple sites and interconnected by a communicationnetwork.

The processes and logic flows described in this specification can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby, and apparatus can also be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application-specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processors of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read-only memory ora random access memory or both. The essential elements of a computer area processor for performing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto-optical disks, or optical disks. However, a computerneed not have such devices. Moreover, a computer can be embedded inanother device, e.g., a digital picture frame, a mobile telephone, apersonal digital assistant (PDA), a mobile audio or video player, a gameconsole, a Global Positioning System (GPS) receiver, to name just a few.

Computer-readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks. The processor and the memory can besupplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back-end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front-end component, e.g., aclient computer having a graphical user interface or a Web browserthrough which a user can interact with an implementation of the subjectmatter described is this specification, or any combination of one ormore such back-end, middleware, or front-end components. The componentsof the system can be interconnected by any form or medium of digitaldata communication, e.g., a communication network. Examples ofcommunication networks include a local area network (“LAN”) and a widearea network (“WAN”), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or of what may be claimed, but rather as descriptions offeatures that may be specific to particular embodiments of particularinventions. Certain features that are described in this specification inthe context of separate embodiments can also be implemented incombination in a single embodiment. Conversely, various features thatare described in the context of a single embodiment can also beimplemented in multiple embodiments separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the embodiments described above should not be understoodas requiring such separation in all embodiments, and it should beunderstood that the described program components and systems cangenerally be integrated together in a single software product orpackaged into multiple software products.

Particular embodiments of the subject matter described in thisspecification have been described. Other embodiments are within thescope of the following claims. For example, the actions recited in theclaims can be performed in a different order and still achieve desirableresults. As one example, the processes depicted in the accompanyingfigures do not necessarily require the particular order shown, orsequential order, to achieve desirable results. In certainimplementations, multitasking and parallel processing may beadvantageous.

1. A method comprising: generating a raw name detection model using acollection of family names and an annotated corpus including acollection of n-grams, each n-gram having a corresponding probability ofoccurring as a name in the annotated corpus; applying the raw namedetection model to a collection of semi-structured data to formannotated semi-structured data, the annotated semi-structured dataidentifying n-grams identifying names and n-grams not identifying names;applying the raw name detection model to a large unannotated corpus toform a large annotated corpus data identifying n-grams of the largeunannotated corpus identifying names and n-grams not identifying names;and generating a name detection model including: deriving a name modelusing the annotated semi-structured data identifying names and the largeannotated corpus data identifying names, deriving a not-name model usingthe semi-structured data not identifying names, and deriving a languagemodel using the large annotated corpus.
 2. The method of claim 1,further comprising: applying the name detection model to the collectionof semi-structured data to form the annotated semi-structured data, theannotated semi-structured data identifying n-grams identifying names andn-grams not identifying names; applying the name detection model to thelarge unannotated corpus to form the large annotated corpus dataidentifying n-grams of the large unannotated corpus identifying namesand n-grams not identifying names; and generating a refined namedetection model including: deriving a refined name model using theannotated semi-structured data identifying names and the large annotatedcorpus data identifying names, deriving a refined not-name model usingthe semi-structured data not identifying names, and deriving a refinedlanguage model using the large annotated corpus.
 3. The method of claim1, wherein the name model includes: a collection of n-grams from theannotated semi-structured data identifying names and the large annotatedcorpus identifying names, where each n-gram includes a family name as aleft character and a given name as right context, and each n-gram has acorresponding probability of identifying a name.
 4. The method of claim1, wherein the not-name model includes: a collection of n-grams from theannotated semi-structured data not identifying names, where each n-gramincludes a family name as a left character and a given name as rightcontext, and each n-gram has a corresponding probability of notidentifying a name.
 5. The method of claim 1, wherein the raw namedetection model includes: a collection of n-grams from the annotatedcorpus, where each n-gram includes a left character that is a familyname from the collection of family names, and each n-gram has acorresponding probability of identifying a name according to a relativefrequency of the name in the annotated corpus.
 6. The method of claim 1,wherein the raw name model is generated using a collection of foreignfamily names.
 7. The method of claim 1, wherein the collection of familynames includes a plurality of sparse family names; and the raw namedetection model uses a single probability of all sparse family names inplace of a calculated probability of a specific sparse family name ofthe plurality of spare family names to identify probabilities of eachn-gram, that includes a left character that is a sparse family name,identifying a name.
 8. The method of claim 1, wherein the collection offamily names includes a plurality of foreign family names.
 9. A methodcomprising: receiving an input string of characters; and applying a namedetection model to the input string having a plurality of characters,including: identifying a most likely segmentation of the plurality ofcharacters where the plurality of characters do not include one or morenames, detecting one or more sequences of characters of the plurality ofcharacters as potentially identifying one or more names, identifying asegmentation of the plurality of characters where the plurality ofcharacters include the one or more potential names, and segmenting theplurality of characters as including the one or more names when thelikelihood of the segmentation including the potential one or more namesis greater than the most likely segmentation not including one or morenames.
 10. The method of claim 9 further comprising: detecting one ormore names when the plurality of characters is segmented as includingone or more names.
 11. The method of claim 1 further comprising:receiving a string including a plurality of characters; and calculatinga probability that a particular sequence of the string identifies aname, the name includes a family name and a given name, including: whenthe frequency of the particular sequence in a corpus is less than athreshold value, determining the probability that the particularsequence identifies a name as a function of a relative frequency thatthe portion of the sequence representing a given name occurs with anyfamily name and the relative frequency of the portion of the sequencerepresenting the family name.
 12. The method of claim 1 furthercomprising: receiving user input data; and applying the raw namedetection model to the user input data to form annotated user inputdata, the annotated user input data identifying n-grams identifyingnames and n-grams not identifying names; where generating the namedetection model further includes: deriving the name model using theannotated user input data identifying names, deriving the not-name modelusing the annotated user input data not identifying names, and derivinga language model using the annotated user input data.
 13. (canceled) 14.(canceled)
 15. A computer program product, encoded on a tangible programcarrier, operable to cause data processing apparatus to performoperations comprising: generating a raw name detection model using acollection of family names and an annotated corpus including acollection of n-grams, each n-gram having a corresponding probability ofoccurring as a name in the annotated corpus; applying the raw namedetection model to a collection of semi-structured data to formannotated semi-structured data, the annotated semi-structured dataidentifying n-grams identifying names and n-grams not identifying names;applying the raw name detection model to a large unannotated corpus toform a large annotated corpus data identifying n-grams of the largeunannotated corpus identifying names and n-grams not identifying names;and generating a name detection model including: deriving a name modelusing the annotated semi-structured data identifying names and the largeannotated corpus data identifying names, deriving a not-name model usingthe semi-structured data not identifying names, and deriving a languagemodel using the large annotated corpus.
 16. The computer program productof claim 15, operable to cause data processing apparatus to performoperations further comprising: applying the name detection model to thecollection of semi-structured data to form the annotated semi-structureddata, the annotated semi-structured data identifying n-grams identifyingnames and n-grams not identifying names; applying the name detectionmodel to the large unannotated corpus to form the large annotated corpusdata identifying n-grams of the large unannotated corpus identifyingnames and n-grams not identifying names; and generating a refined namedetection model including: deriving a refined name model using theannotated semi-structured data identifying names and the large annotatedcorpus data identifying names, deriving a refined not-name model usingthe semi-structured data not identifying names, and deriving a refinedlanguage model using the large annotated corpus.
 17. The computerprogram product of claim 15, wherein the name model includes: acollection of n-grams from the annotated semi-structured dataidentifying names and the large annotated corpus identifying names,where each n-gram includes a family name as a left character and a givenname as right context, and each n-gram has a corresponding probabilityof identifying a name.
 18. The computer program product of claim 15,wherein the not-name model includes: a collection of n-grams from theannotated semi-structured data not identifying names, where each n-gramincludes a family name as a left character and a given name as rightcontext, and each n-gram has a corresponding probability of notidentifying a name.
 19. The computer program product of claim 15,wherein the raw name detection model includes: a collection of n-gramsfrom the annotated corpus, where each n-gram includes a left characterthat is a family name from the collection of family names, and eachn-gram has a corresponding probability of identifying a name accordingto a relative frequency of the name in the annotated corpus.
 20. Thecomputer program product of claim 15, wherein the raw name model isgenerated using a collection of foreign family names.
 21. The computerprogram product of claim 15, wherein the collection of family namesincludes a plurality of sparse family names; and the raw name detectionmodel uses a single probability of all sparse family names in place of acalculated probability of a specific sparse family name of the pluralityof spare family names to identify probabilities of each n-gram, thatincludes a left character that is a sparse family name, identifying aname.
 22. The computer program product of claim 15, wherein thecollection of family names includes a plurality of foreign family names.23. A computer program product, encoded on a tangible program carrier,operable to cause data processing apparatus to perform operationscomprising: receiving an input string of characters; and applying a namedetection model to the input string having a plurality of characters,including: identifying a most likely segmentation of the plurality ofcharacters where the plurality of characters do not include one or morenames, detecting one or more sequences of characters of the plurality ofcharacters as potentially identifying one or more names, identifying asegmentation of the plurality of characters where the plurality ofcharacters include the one or more potential names, and segmenting theplurality of characters as including the one or more names when thelikelihood of the segmentation including the potential one or more namesis greater than the most likely segmentation not including one or morenames.
 24. The computer program product of claim 23, operable to causedata processing apparatus to perform operations further comprising:detecting one or more names when the plurality of characters issegmented as including one or more names.
 25. The computer programproduct of claim 15, operable to cause data processing apparatus toperform operations further comprising: receiving a string including aplurality of characters; and calculating a probability that a particularsequence of the string identifies a name, the name includes a familyname and a given name, including: when the frequency of the particularsequence in a corpus is less than a threshold value, determining theprobability that the particular sequence identifies a name as a functionof a relative frequency that the portion of the sequence representing agiven name occurs with any family name and the relative frequency of theportion of the sequence representing the family name.
 26. The computerprogram product of claim 15, operable to cause data processing apparatusto perform operations further comprising: receiving user input data; andapplying the raw name detection model to the user input data to formannotated user input data, the annotated user input data identifyingn-grams identifying names and n-grams not identifying names; whereingenerating the name detection model further includes: deriving the namemodel using the annotated user input data identifying names, derivingthe not-name model using the annotated user input data not identifyingnames, and deriving a language model using the annotated user inputdata.
 27. A system comprising: a raw name model including a collectionof family names and an annotated corpus including a collection ofn-grams, each n-gram having a corresponding probability of occurring asa name in the annotated corpus; annotated semi-structured data formed byapplying the raw name detection model to a collection of semi-structureddata to form, the annotated semi-structured data identifying n-gramsidentifying names and n-grams not identifying names; large annotatedcorpus data formed by applying the raw name detection model to acollection of a large unannotated corpus, the large annotated corpusdata identifying n-grams of the large unannotated corpus identifyingnames and n-grams not identifying names by applying the raw namedetection model; and a name detection model including: a name modelderived from the annotated semi-structured data identifying names andthe large annotated corpus data identifying names, a not-name modelderived from the semi-structured data not identifying names, and alanguage model derived from the large annotated corpus.
 28. The systemof claim 27, wherein: the name detection model is applied to thecollection of semi-structured data to form the annotated semi-structureddata, the annotated semi-structured data identifying n-grams identifyingnames and n-grams not identifying names; the name detection model isapplied to the large unannotated corpus to form the large annotatedcorpus data identifying n-grams of the large unannotated corpusidentifying names and n-grams not identifying names; and the systemfurther comprises a refined name detection model including: a refinedname model derived from the annotated semi-structured data identifyingnames and the large annotated corpus data identifying names, a refinednot-name model derived from the semi-structured data not identifyingnames, and a refined language model derived from the large annotatedcorpus.
 29. The system of claim 27, wherein the name model includes: acollection of n-grams from the annotated semi-structured dataidentifying names and the large annotated corpus identifying names,where each n-gram includes a family name as a left character and a givenname as right context, and each n-gram has a corresponding probabilityof identifying a name.
 30. The system of claim 27, wherein the not-namemodel includes: a collection of n-grams from the annotatedsemi-structured data not identifying names, where each n-gram includes afamily name as a left character and a given name as right context, andeach n-gram has a corresponding probability of not identifying a name.31. The system of claim 27, wherein the raw name detection modelincludes: a collection of n-grams from the annotated corpus, where eachn-gram includes a left character that is a family name from thecollection of family names, and each n-gram has a correspondingprobability of identifying a name according to a relative frequency ofthe name in the annotated corpus.
 32. The system of claim 27, whereinthe raw name model is generated using a collection of foreign familynames.
 33. The system of claim 27, wherein the collection of familynames includes a plurality of sparse family names; and the raw namedetection model uses a single probability of all sparse family names inplace of a calculated probability of a specific sparse family name ofthe plurality of spare family names to identify probabilities of eachn-gram, that includes a left character that is a sparse family name,identifying a name.
 34. The system of claim 27, wherein the collectionof family names includes a plurality of foreign family names.
 35. Asystem comprising one or more computers operable to perform operationsincluding: receiving an input string of characters; and applying a namedetection model to the input string having a plurality of characters,including: identifying a most likely segmentation of the plurality ofcharacters where the plurality of characters do not include one or morenames, detecting one or more sequences of characters of the plurality ofcharacters as potentially identifying one or more names, identifying asegmentation of the plurality of characters where the plurality ofcharacters include the one or more potential names, and segmenting theplurality of characters as including the one or more names when thelikelihood of the segmentation including the potential one or more namesis greater than the most likely segmentation not including one or morenames.
 36. The system of claim 35 comprising one or more computersoperable to perform operations further including: detecting one or morenames when the plurality of characters is segmented as including one ormore names.
 37. The system of claim 27 further comprising one or morecomputers operable to perform operations including: receiving a stringincluding a plurality of characters; calculating a probability that aparticular sequence of the string identifies a name, the name includes afamily name and a given name, including: when the frequency of theparticular sequence in a corpus is less than a threshold value,determining the probability that the particular sequence identifies aname as a function of a relative frequency that the portion of thesequence representing a given name occurs with any family name and therelative frequency of the portion of the sequence representing thefamily name.
 38. The system of claim 27 further comprising one or morecomputers operable to perform operations including: receiving user inputdata; and applying the raw name detection model to the user input datato form annotated user input data, the annotated user input dataidentifying n-grams identifying names and n-grams not identifying names;wherein generating the name detection model further includes: derivingthe name model using the annotated user input data identifying names,deriving the not-name model using the annotated user input data notidentifying names, and deriving a language model using the annotateduser input data.